Analysis of Used Car Price¶

Lanny Chen¶

Introduction¶

Transportation is an important part of people's lives. On Vox's website https://www.vox.com/future-perfect/23578481/how-to-live-without-a-car, they mentioned that only about 8% of the US families don't own a vehicle, as a result, buying a car with our own budget is significantly important for people to consider the price of the cars. Some people will choose to buy new cars, while the others may consider the used cars to have a better price. While people buy the cars, they need to consider multiple factors, for example, the conditions, the car age, transmissions, and so on.

Today, let's dive deeper into different factors that will affect the car price on the used cars dataset, and discover how we can analyze the trends and use the machine learning model to go through the whole data science life cycle.

Table of Contents¶

  • Imports
  • Data Collection
  • Data Processing
  • Exploratory Analysis & Data Visualization
  • Model: Analysis, Hypothesis Testing, & ML
  • Interpretation: Insight & Policy Decision

Imports ¶

In this data analysis, we will use different Python libraries for our study. The most frequent library that we use here is pandas, it's a powerful library to have the powerful data structure to use here: DataFrame
Documentation of these libraries are here:
Pandas: https://pandas.pydata.org/docs/
Matplotlib: https://matplotlib.org/stable/index.html
Seaborn: https://seaborn.pydata.org/tutorial.html
Numpy: https://numpy.org/doc/ Statsmodels: https://www.statsmodels.org/stable/index.html Scikit-learn: https://scikit-learn.org/stable/

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Data Collection ¶

In this section, we will collect the data from Kaggle about the Used Car Price updated 3 years ago. This data set consists of different variables related to the used car. As we are considering different factors related to the price, especially with some important factors people consider when buying used cars. Then let's collecting some data!
Data Resource: Used Car Dataset

In [2]:
filename = 'vehicles.csv'
df = pd.read_csv(filename)
df.head()
Out[2]:
id url region region_url price year manufacturer model condition cylinders ... size type paint_color image_url description county state lat long posting_date
0 7222695916 https://prescott.craigslist.org/cto/d/prescott... prescott https://prescott.craigslist.org 6000 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN az NaN NaN NaN
1 7218891961 https://fayar.craigslist.org/ctd/d/bentonville... fayetteville https://fayar.craigslist.org 11900 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN ar NaN NaN NaN
2 7221797935 https://keys.craigslist.org/cto/d/summerland-k... florida keys https://keys.craigslist.org 21000 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN fl NaN NaN NaN
3 7222270760 https://worcester.craigslist.org/cto/d/west-br... worcester / central MA https://worcester.craigslist.org 1500 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN ma NaN NaN NaN
4 7210384030 https://greensboro.craigslist.org/cto/d/trinit... greensboro https://greensboro.craigslist.org 4900 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN nc NaN NaN NaN

5 rows × 26 columns

Data Processing ¶

After gathering the initial dataset, we will start to clean up the data and rename the columns into a better format for future used of the data visualization.
The first step that we are going to do here is to choose the columns that will be useful for the future analysis. Before we are doing any changes to the data set, we need to save a copy of the datafram to make sure we always have a copy of the original data, and this will save time to retunning the dataset from the beginning if making any mistakes.

In [3]:
copy1 = df.copy()
copy1.head()
Out[3]:
id url region region_url price year manufacturer model condition cylinders ... size type paint_color image_url description county state lat long posting_date
0 7222695916 https://prescott.craigslist.org/cto/d/prescott... prescott https://prescott.craigslist.org 6000 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN az NaN NaN NaN
1 7218891961 https://fayar.craigslist.org/ctd/d/bentonville... fayetteville https://fayar.craigslist.org 11900 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN ar NaN NaN NaN
2 7221797935 https://keys.craigslist.org/cto/d/summerland-k... florida keys https://keys.craigslist.org 21000 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN fl NaN NaN NaN
3 7222270760 https://worcester.craigslist.org/cto/d/west-br... worcester / central MA https://worcester.craigslist.org 1500 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN ma NaN NaN NaN
4 7210384030 https://greensboro.craigslist.org/cto/d/trinit... greensboro https://greensboro.craigslist.org 4900 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN nc NaN NaN NaN

5 rows × 26 columns

The goal for our data analysis is to see other factors that may affect the price and we will choose some columns that we think the constumer will consider more than the other variables. Then we can drop some columns, we then choose: region, model, id, url, region_url, VIN, description, image_url, lat, long. These variables are not directly affect the car price as we know, then we will consider these variables will not affect the used car price here.

In [4]:
copy1.drop(columns=['region','model','id','url','region_url','VIN','description', 'image_url', 'lat', 'long'],inplace=True)
copy1.head()
Out[4]:
price year manufacturer condition cylinders fuel odometer title_status transmission drive size type paint_color county state posting_date
0 6000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN az NaN
1 11900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ar NaN
2 21000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN fl NaN
3 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ma NaN
4 4900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN nc NaN

Then we will consider the missing values. The missing value of the dataset could affect the data trend heavily if the proportion of the missing values is large enough. There's several ways to deal with the missing value. The approach that we are using today is to drop the rows of the data with the missing value. But first, to check the missing values with each column, we will print out the shape of the dataframe with the number of rows and columns, then print out the total number of the missing value of each column.

In [5]:
(x,y) = copy1.shape
print('Number of rows is',x)
print('Number of columns is', y)
copy1.isna().sum()
Number of rows is 426880
Number of columns is 16
Out[5]:
price                0
year              1205
manufacturer     17646
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
drive           130567
size            306361
type             92858
paint_color     130203
county          426880
state                0
posting_date        68
dtype: int64

As we can see above with the number of missing values in the datafame, we can see that most of the columns are having missing values. Then we need to choose to drop specific column with certain amount of the missing value. As we notice, there's 426,880 rows of data in total, then condition, cylinders, drive, size, paint_color, county, has over 100,000 rows of missing data. Because we want to have more data avalible to analysis through the time changes, we choose to do a partition on the data with more than 30% of the missing value and doing more analysis of the remaining columns left in the data. At this point, we can choose to have another copy of the dataframe to maintain the shape of previous dataframe.

In [6]:
copy2 = copy1.copy()
copy2.head()
Out[6]:
price year manufacturer condition cylinders fuel odometer title_status transmission drive size type paint_color county state posting_date
0 6000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN az NaN
1 11900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ar NaN
2 21000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN fl NaN
3 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ma NaN
4 4900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN nc NaN
In [7]:
# x here is the number of rows, we are selecting the total number of the NaN is bigger than 30
missing_columns = copy2.columns[copy2.isna().sum()>(x*0.3)]
print(missing_columns.to_numpy())

copy2.drop(columns=missing_columns, inplace=True)
copy2.head()
['condition' 'cylinders' 'drive' 'size' 'paint_color' 'county']
Out[7]:
price year manufacturer fuel odometer title_status transmission type state posting_date
0 6000 NaN NaN NaN NaN NaN NaN NaN az NaN
1 11900 NaN NaN NaN NaN NaN NaN NaN ar NaN
2 21000 NaN NaN NaN NaN NaN NaN NaN fl NaN
3 1500 NaN NaN NaN NaN NaN NaN NaN ma NaN
4 4900 NaN NaN NaN NaN NaN NaN NaN nc NaN

As what we have here, we can see the copy2 drops the columns 'condition', 'cylinders', 'drive', 'size', 'paint_color', and ‘county'. These columns are consist with more than 30% of the missing value, as the processing of dropping these columns, we can save more entries because we have less rows with missing value need to be drop with the missing value. Then our next step is dropping the NaN value to make future used of the data being more efficient.

In [8]:
copy2.dropna(inplace=True)
(x2,y2) = copy2.shape
print("Now the number of rows is: ", x2)
print("Now the number of columns is: ", y2)
copy2.head()
Now the number of rows is:  309826
Now the number of columns is:  10
Out[8]:
price year manufacturer fuel odometer title_status transmission type state posting_date
27 33590 2014.0 gmc gas 57923.0 clean other pickup al 2021-05-04T12:31:18-0500
28 22590 2010.0 chevrolet gas 71229.0 clean other pickup al 2021-05-04T12:31:08-0500
29 39590 2020.0 chevrolet gas 19160.0 clean other pickup al 2021-05-04T12:31:25-0500
30 30990 2017.0 toyota gas 41124.0 clean other pickup al 2021-05-04T10:41:31-0500
31 15000 2013.0 ford gas 128000.0 clean automatic truck al 2021-05-03T14:02:03-0500

Now we clean the data that we can obviously see from the dataset. Then we found another important issue here about the year of the data. The year here represents the entry year of the car. However, when we are trying to buy a car, it's not the meaning that we wnat to know whey buying the car. For example, if the year the car is in 1996 and the car is selling in 1997, the owner will sell their car for 30,000 and this is reasonable. However, if we are buying the same car in 2021 and the price is still the same, then this is not the reasonable explanation to consider when buying a used car. In reality, when we are buying a used car, we will consider the year made, but the actual factor that we are considering is the car age, which is the range between the time people try to buy a car and the year that the car is made. Therefore, we will use the year from the posting date above to minus the year made, then we can get an approximate car age with their price listing.

In [9]:
copy2['posting_date'] = pd.to_datetime(copy2['posting_date'], format='%Y-%m-%dT%H:%M:%S%z', utc=True)
copy2['Car Age (Yrs)'] = copy2['posting_date'].dt.year - copy2['year']
need_to_drop = ['year', 'posting_date']
copy2.drop(columns=need_to_drop,inplace=True)
copy2.head()
Out[9]:
price manufacturer fuel odometer title_status transmission type state Car Age (Yrs)
27 33590 gmc gas 57923.0 clean other pickup al 7.0
28 22590 chevrolet gas 71229.0 clean other pickup al 11.0
29 39590 chevrolet gas 19160.0 clean other pickup al 1.0
30 30990 toyota gas 41124.0 clean other pickup al 4.0
31 15000 ford gas 128000.0 clean automatic truck al 8.0

After we getting the car age, we want to rename the columns here to make plotting multiple variables at the same time easier with the lables and the titles.

In [10]:
copy2.rename(columns={'region': 'Region'}, inplace=True)
copy2.rename(columns={'price': 'Price ($)'}, inplace=True)
copy2.rename(columns={'manufacturer': 'Manufacturer'}, inplace=True)
copy2.rename(columns={'model': 'Model'}, inplace=True)
copy2.rename(columns={'fuel': 'Fuel'}, inplace=True)
copy2.rename(columns={'odometer': 'Odometer (Miles)'}, inplace=True)
copy2.rename(columns={'title_status': 'Title Status'}, inplace=True)
copy2.rename(columns={'transmission': 'Transmission'}, inplace=True)
copy2.rename(columns={'type': 'Type'}, inplace=True)
copy2.rename(columns={'state': 'State'}, inplace=True)
copy2.rename(columns={'lat': 'Latitude'}, inplace=True)
copy2.rename(columns={'long': 'Longitude'}, inplace=True)

copy2 = copy2[copy2['Car Age (Yrs)'] >= 0]
copy2 = copy2[copy2['Price ($)'] >= 0 ]
copy2 = copy2.sort_values(by='Car Age (Yrs)', ascending=True)
copy2.reset_index(inplace=True, drop=True)
copy2.head()
Out[10]:
Price ($) Manufacturer Fuel Odometer (Miles) Title Status Transmission Type State Car Age (Yrs)
0 52498 jeep gas 610.0 clean manual pickup oh 0.0
1 27900 chevrolet gas 18320.0 clean automatic SUV ca 0.0
2 0 jeep other 658.0 clean automatic SUV ar 0.0
3 65085 chevrolet diesel 20.0 clean automatic truck mt 0.0
4 88992 ford diesel 212.0 clean automatic pickup tx 0.0

Now we have the final data from the initial data cleaning. In this case, we haven't gotten into the data itself and have a general idea about the value of the data and problems that it generates with the data. But let's move to the next step of the data science life cycle and figure it out!

Exploratory Analysis & Data Visualization ¶

In this section, we finished the basic cleaning of the data. Now we can go deeper to see how different factors will affect the price of used cars with the data visualization. Before we start getting the plots and explaining the data, we will do the same thing as before, making a new copy of the dataframe.

In [11]:
copy3=copy2.copy()
copy3.head()
Out[11]:
Price ($) Manufacturer Fuel Odometer (Miles) Title Status Transmission Type State Car Age (Yrs)
0 52498 jeep gas 610.0 clean manual pickup oh 0.0
1 27900 chevrolet gas 18320.0 clean automatic SUV ca 0.0
2 0 jeep other 658.0 clean automatic SUV ar 0.0
3 65085 chevrolet diesel 20.0 clean automatic truck mt 0.0
4 88992 ford diesel 212.0 clean automatic pickup tx 0.0

To see the price changes, let's first plot a graph with the car ages. The car age represents the value from the year made to the data that they were listing to the dataset, and the data count here could show more idea about which car age will be more popular to show up in the used car market.

In [12]:
plt.figure(figsize=(8,6))
plt.hist(copy3['Car Age (Yrs)'], bins=20)
plt.title('Value Counts for the Car Age')
plt.xlabel('Car Age (Yrs)')
plt.ylabel('Frequency')
plt.show()

As we can see here, the most frequently appearing car age is smaller than 10, to get a better range, let's do another partition of keeping the cars with the car age smaller than 10 years. This can smaller the size of our data set but still we are keeping the largest part of the data from with the car age.

Within 10 Years¶

In [13]:
copy3 = copy3[(copy3['Car Age (Yrs)'] <= 10) & (copy3['Car Age (Yrs)'] >= 0)]
copy3 = copy3[copy3['Car Age (Yrs)'] >= 0]
copy3.reset_index(inplace=True, drop=True)
copy3.head()
Out[13]:
Price ($) Manufacturer Fuel Odometer (Miles) Title Status Transmission Type State Car Age (Yrs)
0 52498 jeep gas 610.0 clean manual pickup oh 0.0
1 27900 chevrolet gas 18320.0 clean automatic SUV ca 0.0
2 0 jeep other 658.0 clean automatic SUV ar 0.0
3 65085 chevrolet diesel 20.0 clean automatic truck mt 0.0
4 88992 ford diesel 212.0 clean automatic pickup tx 0.0

Now, let's do a violin plot of the price with the year, then we cann see the distribution will look like here.

In [14]:
plt.figure(figsize=(8,6))
sns.violinplot(x=copy3['Car Age (Yrs)'], y=copy3['Price ($)'])
plt.title('Price ($) vs. Car Age (Yrs)')
plt.xlabel('Car Age (Yrs)')
plt.ylabel('Price ($)')
plt.show()

The violin plot showing here generates some error as we can see. we can see that the maximum value of the price is much higher for the actual data that we can see here. and there's some extremely large outlier that may occur, which means that there's a certain car that is having a much larger value for the entry price. What we are doing here first is to check how the price columns' value varies.

In [15]:
temp = copy3['Price ($)'].describe()
temp
Out[15]:
count    2.212990e+05
mean     2.303068e+04
std      2.709840e+05
min      0.000000e+00
25%      1.099900e+04
50%      2.048500e+04
75%      3.159000e+04
max      1.234568e+08
Name: Price ($), dtype: float64

With the data showing above, we can clearly find that the maximum value in the data set is severely bigger than the 75 percentile. Then, what we are going to do here is, we will be testing the possible values to drop starting from 99 percentile to become eventually smaller to 95 percentile, then we can check the best plotting for the distribution of the graph, then we will doing the value save to our original dataframe(here is the copy3)

In [16]:
percentile = [0.99, 0.98, 0.97, 0.96, 0.95]
for i in percentile: 
    value = copy3['Price ($)'].quantile(i)
    temp2 = copy3[copy3['Price ($)'] <= value]

    plt.figure(figsize=(6,4))
    sns.violinplot(x=temp2['Car Age (Yrs)'], y=temp2['Price ($)'])
    title = 'Price ($) vs. Car Age (Yrs) For ' + str(i * 100) + '% Value Dropped'
    plt.title(title)    
    plt.xlabel('Car Age (Yrs)')
    plt.ylabel('Price ($)')
    plt.show()

As shown in the above violin plot, we can see that all the graphs tend to have the same shape, but with the other plots except for the first one, they will have a price lower than 0 which is not what we want here as we already do a partition: we already drop the data with the price being negative. As a result, 99 percentile is good enough for the next section to use, then let's doing another partition over the price here:

In [17]:
value = copy3['Price ($)'].quantile(0.99)
copy3=copy3[copy3['Price ($)'] <= value]
plt.figure(figsize=(8,6))
sns.violinplot(x=copy3['Car Age (Yrs)'], y=copy3['Price ($)'])
plt.title('Price ($) vs. Car Age (Yrs)')
plt.xlabel('Car Age (Yrs)')
plt.ylabel('Price ($)')
plt.show() 

Now let's analyze the median of the violin plot (the white dot in the middle). We can see that except for the 0 years of the car age having a lower average price for the used car than the next year, starting from the car age of 1, we can see a decreasing trend of the average value. Then let's analyze the quantity of the data. We can clearly see that with 0 car age, the cars tend to have more data consistent with lower price; for the car with the age vary from 1 to 10, then we can see the value is closer to its mean and create a trend such that older cars have lower price.

In [18]:
copy4=copy3.copy()
copy4.head()
Out[18]:
Price ($) Manufacturer Fuel Odometer (Miles) Title Status Transmission Type State Car Age (Yrs)
0 52498 jeep gas 610.0 clean manual pickup oh 0.0
1 27900 chevrolet gas 18320.0 clean automatic SUV ca 0.0
2 0 jeep other 658.0 clean automatic SUV ar 0.0
3 65085 chevrolet diesel 20.0 clean automatic truck mt 0.0
7 274 jeep gas 10.0 clean automatic SUV il 0.0
In [19]:
mean = copy4.groupby("Manufacturer")['Price ($)'].mean().reset_index()
mean = mean.sort_values(by='Price ($)', ascending=False)

plt.figure(figsize=(14,10))
x = sns.barplot(x='Manufacturer', y='Price ($)',data=mean)
x.set_xticklabels(x.get_xticklabels(), rotation=90, fontsize=10)
plt.title('Price ($) vs. Manufacturer')
plt.show() 

mean = copy4.groupby("State")['Price ($)'].mean().reset_index()
mean = mean.sort_values(by='Price ($)', ascending=False)

plt.figure(figsize=(14,10))
x = sns.barplot(x='State', y='Price ($)',data=mean)
x.set_xticklabels(x.get_xticklabels(), rotation=90, fontsize=10)
plt.title('Price ($) vs. State')
plt.show() 

As we see closely to the Manufacturer, we can see that Ferrari has the highest average price, and land rover has the lowest average price.

For the states with the price, we can see that Michigan has the highest average price and Maine has the lowest average car price.

Let's see other variables related to price change!!!

In [20]:
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Odometer (Miles)', y='Price ($)', data=copy4)
plt.title('Price ($) vs. Odometer (Miles)')
plt.xlabel('Odometer (Miles)')
plt.ylabel('Price ($)')
plt.show()

As we look at the Odometers, we can find that the Odometer also has some extremely large value to let the scatters be centered on 0 miles. Then let's doing the same 99 percentile partition of the Price for Odometer variable:

In [21]:
value = copy4['Odometer (Miles)'].quantile(0.99)
copy4=copy4[copy4['Odometer (Miles)'] <= value]
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Odometer (Miles)', y='Price ($)', data=copy4)
sns.regplot(x='Odometer (Miles)', y='Price ($)', data=copy4, scatter=False, color='red')
plt.title('Price ($) vs. Odometer (Miles)')
plt.xlabel('Odometer (Miles)')
plt.ylabel('Price ($)')
plt.show()

For the Odometers, we can see the regression line is having a decreasing trend as the odometer increases, which means that the cars with higher odometers will sell at a lower price. However, we can also find out that there's too many values in the plots, which is not good for future analysis.

The next step for us is to see the relationship between price and the remaining variables, what we are going to choose here for the plot is the boxplot, it can show the percentile and the median well.

In [22]:
columns = ['Fuel', 'Title Status', 'Transmission', 'Type']
fig, axs = plt.subplots(2, 2, figsize=(16, 10))
axs = axs.flatten()

for i, j in enumerate(columns): 
    sns.boxplot(x=j, y='Price ($)', data=copy4, ax=axs[i])
    title = j + ' vs. Price ($)'
    axs[i].set_title(title)
    axs[i].set_xlabel(j)
    axs[i].set_ylabel('Price ($)')
    axs[i].set_xticklabels(axs[i].get_xticklabels(), rotation=45, fontsize=10)

plt.subplots_adjust( wspace=0.4, hspace=0.4)
plt.show()
  • For the Fuel vs. Price, we can see that diessel having the highest median than the other Fuel type, and the hybrid car is having an lower median price than the other type.
  • For the Title vs. Price, we can see that there's two types of title that is having a higher price: clean and lien. For more information about lien, see https://mva.maryland.gov/about-mva/Pages/info/27300/27300-13T.aspx#:~:text=A%20lien%20is%20a%20record,does%20not%20repay%20the%20loan.
  • For the Transmission vs. Price, we can see that the other transmission will have an overall higher price than the other transmission way in the used car market.
  • For the Type of the car, we can see the truck and others having the highest price, and the highest price is not an outlier in its boxplot. If we look at the median price of the car type, we can see that pickup and truck have similar median prices, but pickup type has more concentrated data than the truck. The mini-van holds the lowest median price than the other types of cars.

As we can see above, we can find that there's still a lot of data remainning in the odormeters and year. Without the regression line, it will be hard to determine the trends of the dataset. So let's doing another partition here to see if that's better than above!!!

Within 5 Years¶

In [23]:
temp = copy2.copy()
temp.head()
Out[23]:
Price ($) Manufacturer Fuel Odometer (Miles) Title Status Transmission Type State Car Age (Yrs)
0 52498 jeep gas 610.0 clean manual pickup oh 0.0
1 27900 chevrolet gas 18320.0 clean automatic SUV ca 0.0
2 0 jeep other 658.0 clean automatic SUV ar 0.0
3 65085 chevrolet diesel 20.0 clean automatic truck mt 0.0
4 88992 ford diesel 212.0 clean automatic pickup tx 0.0

Now we have the dataset before we are keeping the car age smaller or equal to 10 years and keeping the 99 % of the data from the Odometer. Then let's starting over and doing a keep with the car age smaller or equal to 5 years, and keeping the 99 % of the data from the Odometer and the Price.

In [24]:
temp = temp[temp['Car Age (Yrs)'] <= 5]
temp.reset_index(inplace=True, drop=True)
value = temp['Price ($)'].quantile(0.99)
temp=temp[temp['Price ($)'] <= value]
value = temp['Odometer (Miles)'].quantile(0.99)
temp=temp[temp['Odometer (Miles)'] <= value]

plt.figure(figsize=(8,6))
sns.violinplot(x=temp['Car Age (Yrs)'], y=temp['Price ($)'])
plt.title('Price ($) vs. Car Age (Yrs)')
plt.xlabel('Car Age (Yrs)')
plt.ylabel('Price ($)')
plt.show()

As we can see more clearly right now, we can find out that the 0 years of the car age is tending to have lower prices. But the other car ages are following a decreasing trend by looking at the median of the value. By looking more closely for the year 1 - 5, we can find that the frequency of the value between the median is moving downwards: the width above the median is becoming smaller as the car age increases.

In [25]:
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Odometer (Miles)', y='Price ($)', data=temp)
sns.regplot(x='Odometer (Miles)', y='Price ($)', data=temp, scatter=False, color='red')
plt.title('Price ($) vs. Odometer (Miles)')
plt.xlabel('Odometer (Miles)')
plt.ylabel('Price ($)')
plt.show()

With fewer years thus we still have a lot of scatters, but now we can see better trends here as the scatters are more spreading around the regression line fitting by the seaborn. By observing the plots, we can still see that there's more scatter above the line, which means there's more cars with the same odometer labeling their price higher than the linear relationship between the price and odometer.

In [26]:
mean = temp.groupby("Manufacturer")['Price ($)'].mean().reset_index()
mean = mean.sort_values(by='Price ($)', ascending=False)

plt.figure(figsize=(14,10))
x = sns.barplot(x='Manufacturer', y='Price ($)',data=mean)
x.set_xticklabels(x.get_xticklabels(), rotation=90, fontsize=10)
plt.title('Price ($) vs. Manufacturer')
plt.show() 

mean = copy4.groupby("State")['Price ($)'].mean().reset_index()
mean = mean.sort_values(by='Price ($)', ascending=False)

plt.figure(figsize=(14,10))
x = sns.barplot(x='State', y='Price ($)',data=mean)
x.set_xticklabels(x.get_xticklabels(), rotation=90, fontsize=10)
plt.title('Price ($) vs. State')
plt.show() 

Now we can see some clear difference between the car age turning to less than or equal to 5 years. Maximum & Minimum Average Manufacturer: land rover & ferrari
Maximum & Minimum Average State: West Virginia & Oregon
The car price of each state is hard to predict without going to the state website and deriving all the data. However, when we are checking the Manufacturer, we can see that in less than or equal to 10 years car age, fererari is having the highest average price and land rover is having the lowest average price. What it means is, the ferrari cars with the car age between 6-10 years are having a much higher labeling price than the other cars to achieve the average of the price is the highest.

In [27]:
columns = ['Fuel', 'Title Status', 'Transmission', 'Type']
fig, axs = plt.subplots(2, 2, figsize=(16, 10))
axs = axs.flatten()

for i, j in enumerate(columns): 
    sns.boxplot(x=j, y='Price ($)', data=temp, ax=axs[i])
    title = j + ' vs. Price ($)'
    axs[i].set_title(title)
    axs[i].set_xlabel(j)
    axs[i].set_ylabel('Price ($)')
    axs[i].set_xticklabels(axs[i].get_xticklabels(), rotation=45, fontsize=10)

plt.subplots_adjust( wspace=0.4, hspace=0.4)
plt.show()

By plotting out the other variables, we can see similar pattern within 10 years of the car age and within 5 years.

  • For the Fuel vs. Price, we can see that diessel having the highest median than the other Fuel type, and the hybrid car is having an lower median price than the other type.
  • For the Title vs. Price, we can see that there's two types of title that is having a higher price: clean and lien. For more information about lien, see https://mva.maryland.gov/about-mva/Pages/info/27300/27300-13T.aspx#:~:text=A%20lien%20is%20a%20record,does%20not%20repay%20the%20loan.
  • For the Transmission vs. Price, we can see that the other transmission will have an overall higher price than the other transmission way in the used car market.
  • For the Type of the car, we can see that truck and others having the highest price, and the highest price is not an outlier in its boxplot for the truck. If we look at the median price of the car type, we can see that truck has the highest median prices, and the bus have no median price of its boxplots here.

Model: Analysis, Hypothesis Testing, & ML ¶

Now we visuallize the variables that may be related to the used car price, in this case, we choose the odometer and the car age. Then we are moving to the next step to the next data science life cycle. The first step that we are doing here before doing the machine learning is doing a z-test. We will be doing two different test here for the car age within 10 years and within 5 years.

Within 10 Years¶

In [28]:
copy4.rename(columns={'Odometer (Miles)': 'Odometer'}, inplace=True)
copy4.rename(columns={'Car Age (Yrs)': 'Car_Age'}, inplace=True)
copy4.rename(columns={'Price ($)': 'Price'}, inplace=True)


formula = "Price ~  Odometer + Car_Age "
reg = smf.ols(formula = formula, data= copy4).fit()
reg.summary() 
Out[28]:
OLS Regression Results
Dep. Variable: Price R-squared: 0.177
Model: OLS Adj. R-squared: 0.177
Method: Least Squares F-statistic: 2.337e+04
Date: Sat, 18 May 2024 Prob (F-statistic): 0.00
Time: 18:40:34 Log-Likelihood: -2.3590e+06
No. Observations: 216907 AIC: 4.718e+06
Df Residuals: 216904 BIC: 4.718e+06
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.288e+04 61.108 538.059 0.000 3.28e+04 3.3e+04
Odometer -0.0614 0.001 -74.640 0.000 -0.063 -0.060
Car_Age -1402.2333 14.064 -99.703 0.000 -1429.798 -1374.668
Omnibus: 3043.854 Durbin-Watson: 1.881
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4307.149
Skew: 0.177 Prob(JB): 0.00
Kurtosis: 3.592 Cond. No. 1.78e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.78e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

The F-statistic possibility is around 0.00, so there will be a higher possibility that the null hypothesis will state there's no relationship between the other dependent variables and price. That way, we can find some linear regression and get the prediction by having the linear regression model. Then the next step here is to do a linear regression, let's split the dataset to test and train set then train the linear regression model!

In [29]:
copy5=copy4.copy()
copy5.head()
Out[29]:
Price Manufacturer Fuel Odometer Title Status Transmission Type State Car_Age
0 52498 jeep gas 610.0 clean manual pickup oh 0.0
1 27900 chevrolet gas 18320.0 clean automatic SUV ca 0.0
2 0 jeep other 658.0 clean automatic SUV ar 0.0
3 65085 chevrolet diesel 20.0 clean automatic truck mt 0.0
7 274 jeep gas 10.0 clean automatic SUV il 0.0
In [30]:
data = copy5[['Odometer', 'Car_Age']]
y = copy5['Price']

X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)

r_squared = model.score(X_test, y_test)
print('R-squared:', r_squared)

plt.figure(figsize=(14, 8))

plt.subplot(1, 2, 1)
sns.scatterplot(x=X_test['Odometer'], y=y_test, label='Actual')
sns.lineplot(x=X_test['Odometer'], y=y_pred, color='red', label='Predicted')
plt.title('Price vs. Odometer' )
plt.xlabel('Odometer (Miles)')
plt.ylabel('Price ($)')
plt.legend()

plt.subplot(1, 2, 2)
sns.scatterplot(x=X_test['Car_Age'], y=y_test, label='Actual')
sns.lineplot(x=X_test['Car_Age'], y=y_pred, color='red', label='Predicted')
plt.title('Price vs. Car Age')
plt.xlabel('Car Age (Years)')
plt.ylabel('Price ($)')
plt.legend()

plt.tight_layout()
plt.show()
Mean Squared Error: 163872015.67078853
R-squared: 0.17894172914892692

First, let's taking a look to the MSE here: the MSE is 159352766, which means the differnece between the predicted and the actual value has a lot of difference base on the prediction here. As we can see here in the plot, the regression line for the Odometer is not perfectly fitting a line but more like a spread of multiple lines. We assume this situation happened because the dataset consist with a lots of scatters on each value of the Odometer, then it indicating the car price could have varied price prediction base on the miles of the odometer, then it form a certan range of the price prediction base on the regression. Overall, we can see a decreasing trend of the plots Price Vs. Odometer. For the plot Price vs. Car Age, we can see the the regression line is tending to be a straight decreasing line, but it give out a more concentrate prediction base on the car age. Overall, it generates a decreasing trends of the plot, which indicates that if the car age is increasing, then the price of the car will be decreasing.

Within 5 Years¶

In [31]:
temp.rename(columns={'Odometer (Miles)': 'Odometer'}, inplace=True)
temp.rename(columns={'Car Age (Yrs)': 'Car_Age'}, inplace=True)
temp.rename(columns={'Price ($)': 'Price'}, inplace=True)


formula = "Price ~  Odometer + Car_Age "
reg = smf.ols(formula = formula, data= temp).fit()
reg.summary() 
Out[31]:
OLS Regression Results
Dep. Variable: Price R-squared: 0.071
Model: OLS Adj. R-squared: 0.071
Method: Least Squares F-statistic: 4548.
Date: Sat, 18 May 2024 Prob (F-statistic): 0.00
Time: 18:44:18 Log-Likelihood: -1.3052e+06
No. Observations: 118535 AIC: 2.610e+06
Df Residuals: 118532 BIC: 2.610e+06
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.457e+04 109.352 316.096 0.000 3.44e+04 3.48e+04
Odometer -0.0790 0.002 -46.796 0.000 -0.082 -0.076
Car_Age -1641.1907 38.657 -42.455 0.000 -1716.958 -1565.423
Omnibus: 1277.534 Durbin-Watson: 1.890
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1754.222
Skew: 0.149 Prob(JB): 0.00
Kurtosis: 3.517 Cond. No. 1.33e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.33e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

The F-statistic possibility here is also around 0.00, so it means that we can also find some linear regression and get the prediction by having the linear regression model. Then we will do the same next step here for linear regression. Let's split the dataset to test and train set then train the linear regression model!

In [32]:
copy6 = temp.copy()
copy6.head()
Out[32]:
Price Manufacturer Fuel Odometer Title Status Transmission Type State Car_Age
0 52498 jeep gas 610.0 clean manual pickup oh 0.0
1 27900 chevrolet gas 18320.0 clean automatic SUV ca 0.0
2 0 jeep other 658.0 clean automatic SUV ar 0.0
3 65085 chevrolet diesel 20.0 clean automatic truck mt 0.0
6 72992 ford diesel 2541.0 clean automatic pickup tx 0.0
In [33]:
data = copy6[['Odometer', 'Car_Age']]
y = copy6['Price']

X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)

r_squared = model.score(X_test, y_test)
print('R-squared:', r_squared)

plt.figure(figsize=(14, 8))

plt.subplot(1, 2, 1)
sns.scatterplot(x=X_test['Odometer'], y=y_test, label='Actual')
sns.lineplot(x=X_test['Odometer'], y=y_pred, color='red', label='Predicted')
plt.title('Price vs. Odometer' )
plt.xlabel('Odometer (Miles)')
plt.ylabel('Price ($)')
plt.legend()

plt.subplot(1, 2, 2)
sns.scatterplot(x=X_test['Car_Age'], y=y_test, label='Actual')
sns.lineplot(x=X_test['Car_Age'], y=y_pred, color='red', label='Predicted')
plt.title('Price vs. Car Age')
plt.xlabel('Car Age (Years)')
plt.ylabel('Price ($)')
plt.legend()

plt.tight_layout()
plt.show()
Mean Squared Error: 211333150.71850312
R-squared: 0.07427232101929437

We can see that the regression line is still showing regression as multiple of the regression line within a range, but in this plot it's showing a smaller range for the prediction which is starting around 30000 to around 20000 for the price as the odometer increases. Overall, it's still showing a decreasing trend for the price as the odometer increases. The car age linear regression line is showing a more straight line than that line of the plot within 10 years since you can find a little angle as you look closer to the plot. Overall, the Price vs. Car Age is showing a decreasing trend for the price as the car age increases.

Interpretation: Insight & Policy Decision ¶

After first four of the data science life cycle, we are moving to the last part: Interpretation: Insight & Policy Decision. This section always working as the conclusion part.

Base on the Observation that we have above, we can find that:

On the Used Car market, as the used cars' car age increasing, the price of the car will drop. In our observations, we can see that the price over the year showing the decreasing trends, but there's still some car is selling the price much higher/lower than the indication. Same idea for the Odometer's observation, we can find that more scatters are on top of the prediction regression, which means most of the cars are labeling the price higher than the prediction value by the linear regression model.